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The No. 2 Electronic Switching System has been planned with de- 
pendability, maintainability, and operational ease representing a major 
portion oj the total system design concept. The major characteristics that 
jail under these headings are redundancy, trouble detection, trouble re- 
covery, diagnosis, repair, man-machine interface, data protection, and 
operational and administrative procedures for growth and change. All 
these factors play a large role in both the hardware and software portions 
of the system design. This article reviews the highlights in each of these 
areas and describes how they interact in the overall system across both the 
control complex and peripheral system areas. 

I. OBJECTIVES 

A successful local switching system must be dependable as seen 
from the customer's viewpoint and economical to maintain and 
operate as seen from the telephone company's view. This perform- 
ance must be achieved under the stress of a variety of central office 
environmental constraints — physical (temperature, humidity) , elec- 
trical (noise, commercial power interruption), traffic (overload), of- 
fice growth (addition of lines, trunks, networks, stores, and so on), 
and human errors. The design of a switching system such as No. 2 
ESS must then take into account all of these factors in achieving 
the proper balance between equipment first costs and annual opera- 
tion and maintenance costs, while also meeting customer service 
standards. 

l.i Dependability 

The customer measures a system's dependability in service con- 
tinuity and accuracy of call handling (dialing, billing, and so on). 
The design objectives are, therefore, continuous high quality service 
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24 hours a day for a 40-year life. Specific reliability objectives are 
that total system down-time (time during which customer service is 
noticeably degraded) should not exceed two hours during its 40-year 
life and that, on the average, not more than 0.01 percent incorrectly 
handled calls should result from system troubles and errors. Further- 
more, from a customer's service viewpoint, under trouble conditions 
it is preferred that service degradation peaks be avoided — a few 
calls handled incorrectly occasionally or a few very brief outages per 
year causes less customer inconvenience than less frequent but longer 
duration service difficulties and interruptions. 

In order to achieve this degree of dependability, attention must be 
paid both to the operation and administration of the system when it 
is functioning normally, and to its maintainability in the face of 
trouble. 

1.2 Operation and Administration 

While processing calls, the switching machine must also monitor its 
own performance in terms of traffic measurements of calls handled in 
service, and of calls aborted, because of shortages of switching or 
trunk facilities. In addition, plant measurements of aborted calls and 
associated trouble conditions are necessary to provide an indication 
of the machine's health. This data, in hard copy, direct-reading simple 
formats, provides the basis of much of the craftsmen's actions in the 
day-to-day operation of the system. 

In a stored program switching system, reassignments and additions 
of trunks, for example, may involve hardware changes, translation 
memory changes, and updating of administrative forms. Other classes 
of change items include customer line assignments (frequent but 
relatively simple) and the addition of major growth items such as 
networks and stores (less frequent but more complex). Experience 
has shown the importance of carefully human engineering these 
procedures since human intervention is often the source of system 
trouble rather than conventional component failures. 

1.3 Maintainability 

The actual process of trouble detection, system recoveiy, diagnosis, 
repair, and service verification is broadly classed in the maintainabil- 
ity category. In order to achieve the required dependability, the de- 
sign must stress: 

(i) Use of long-life components and adequate circuit margins. 



ADMINISTRATION AND MAINTENANCE 2767 

(ii) A redundancy plan sufficient to keep the system operational 
in the presence of component failures and human intervention for 
administration and growth. 

(Hi) Trouble detection mechanisms (hardware and software) for 
all service effecting parts of the system, either automatic, routine, or, 
at worst, manual. 

(iv) Rapid recovery of the call processing function and protec- 
tion of the calls in progress in the face of either hardware, software, 
or human intervention (procedural) difficulties. 

(v) Diagnosis and isolation of troubles (both hard faults and 
transients) either by automatic (programmed) machine analysis of 
data or by manual means. 

(vi) Repair and verification procedures which allow for requested 
repetitive testing of trouble items, simple mechanical procedures, and 
automatic recheck on service restoral. 

{via) In general, a high level of human engineering in all displays, 
controls, teletypewriter communications, and operational procedures in 
which the craftsman interacts with the machine — hopefully with a 
view to keeping these items as similar as possible between different 
switching systems. 

1.4 Centralized Maintenance and Administration Facilities 

The inherent reliability of solid state components and good circuit 
design result in a significantly lower trouble rate than that experienced 
by comparable electromechanical systems; however, the complexity, 
high speed of operation, loss of the ability to audibly and visually 
observe apparatus operation has tended to increase the skills re- 
quired by plant craftsmen to restore system operation and locate those 
troubles which the machine cannot handle automatically. In addition, 
the low trouble rate experienced does not give craftsmen sufficient 
practice to maintain their proficiency. 

To overcome these possible problems and to take advantage of 
the great potential for annual cost savings, facilities and procedures 
should be provided to allow remote administration and maintenance 
control. Such facilities allow a small group of well-trained craftsmen 
at a centralized point to provide administrative and maintenance 
coverage to a number of ESS type offices almost equal to their being 
physically in each office. The amount of trouble hunting experience, at 
this point, should assure continued craftsman proficiency. Many act- 
ual repair operations, running of cross connections, and other tasks 
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requiring physical contact with the equipment may be taken care of 
by semiskilled craftsmen, either dispatched from a central point or, 
if the work load is adequate, by assigning such craftsmen to the office 
on a schedule. 

The centralization of such facilities can also allow specialization of 
functions in convenient locations. For example, the plant service 
center would have direct access to a teletypewriter for making custo- 
mer line change translations to any of several offices. And the traffic 
department could have access to data from several offices at one 
central location. 

In summary then, the objectives of dependability imply the need 
for a high level of maintainability and operational ease, and the de- 
mands for economy can be substantially aided by centralizing these 
functions. 

U. GENERAL ADMINISTRATION AND MAINTENANCE PLAN 

The dependability, adminstration, and maintainability objectives, 
when applied to stored program switching systems, define the need, 
in computer terms, for an on-line real-time, high availability machine. 
To achieve this economically requires careful initial systems plan- 
ning in basic redundancy configurations, in man-machine interface ca- 
pabilities, and in hardware-software tradeoffs. Though difficult to allo- 
cate precisely, it is estimated that approximately two-thirds of the 
total program is dedicated to a system of maintenance and administra- 
tive programs that are used to administer system redundancy, control 
detection, and diagnostic routines, make performance measurements, 
and provide for communication with the craftsman. It is the need to 
keep the switching system operational during periods of growth and 
change of customer services (major hardware, program and transla- 
tion additions and changes), the need to keep calls being processed 
during switches to standby equipment (preservation of transient 
data), and the requirement of providing simultaneous on-line com- 
munication with a number of craftsmen, that adds extensively to the 
program structure and makes maintenance more than simply a matter 
of diagnostics. 

2.1 Basic Redundancy Plan 

As described in Ref. 1, the entire control unit (program control, 
program store, call store, input-output and peripheral bus system) is 
considered as a single entity which is duplicated. Two control units, 
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plus the maintenance center, comprise a control complex. Since the 
component count and failure rates are sufficiently low, no reconfigura- 
tion within a control unit is necessary, thus simplifying both the 
hardware connections and controlling programs. The maintenance cen- 
ter is unduplicated except for the existence of multiple teletypewriter 
channels (eight maximum) which serve a variety of different func- 
tions. In the peripheral equipment, network and scanner controllers 
and supplementary central pulse distributors are duplicated since 
failure here can effect large numbers of lines. Network fabric, trunks, 
junctors, and service circuits are traffic-engineered items and thus 
contain inherent redundancy. 

In order to provide rapid recovery from troubles and effective con- 
tinuity of service, the processors normally are run in synchronism. 
This enables the off-line processor to keep its registers and call store 
contents continuously up to date and thus constantly available to take 
over on-line functions. This synchronous mode of operation involves 
providing all inputs (both normal scanning and peripheral mainte- 
nance responses) to both processors, while deriving outputs only from 
the on-line machine. Since network and scanner controllers do not 
have the long-term memory functions associated with processors, their 
duplicate mates are operated either in a traffic load shared state or 
in an idle stand-by condition. 

2.2 Man-Machine Interface 

The major communication between the switching machines and the 
craftsman is by teletypewriter. In addition, audible alarms and visual 
displays are used to alert the craftsman to trouble conditions which are 
subsequently more fully reported on a teletypewriter channel. Manual 
controls are available on the maintenance center for performing special 
tests on out-of-service equipment and for taking restart action when 
the system has lost its "sanity" to the point where it can no longer 
interpret teletypewriter input commands. 

2.2.1 Teletypewriter Facilities 

The system program is organized and the hardware is arranged for 
a maximum of eight teletypewriter channels. Each of these channels 
can be programmed to produce only certain classes of messages and to 
accept only a limited class of requests. Any one of these channels can 
operate a remote teletypewriter by the use of a data set. A typical in- 
stallation might include four teletypewriters: 
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(i) Channel 1 — local maintenance 
(ii) Channel 2 — remote maintenance 

These channels report all system maintenance activity 
(troubles detected, diagnosis results, plant registers, and the like) 
and accept all system input messages (maintenance and other). 
(in) Channel 3 — service order 

This channel operates a remote teletypewriter at the plant 
service center which is used to input changes in customer line informa- 
tion (class of service, features, directory and billing numbers, and so 
on). 

(iv) Channel 4 — traffic 

This channel provides traffic data according to a defined 
schedule (items such as trunk group usage, call rate, and dial tone 
delay). Specific data can be requested and the schedule can be 
changed by input messages on this channel. 

All channels have built-in maintenance checks on each message 
and each one is arranged to provide automatic message transfer 
to an alternate channel in case of failure. Alternate backup channel 
definitions can also be changed easily. 

2.2.2 Display and Manual Controls 

In addition to teletypewriter information, the maintenance center 
has quick reference visual displays of items such as system alarm 
status, processor on- off-line status, and processor, trunk, and periph- 
eral unit trouble indications. It also has a dynamic display of the 
characteristic program loop (program address register). 

Manual controls available to the craftsman can be divided into 
three basic categories: 

(i) The most frequently used in normal operation are associated 
with a trunk test panel facility. Trunk testing arrangements provide 
for switched access to trunks and for measuring dc and ac signaling, 
transmission, and noise parameters. To use them, a craftsman activates 
special panel keys, a Touch-Tone® telephone key set, and a teletype- 
writer. 

(ii) Facilities on the maintenance panel provide various tests on 
both the on-line and the off-line processors. Available off-line facili- 
ties include the ability to step through programs or whole routines, 
to interrogate and load registers and call store locations, and to 
preset condition (address) traps by use of the comparator. On-line 
functions include dynamic visual register and store displays, and 
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preset condition program interrupts which can provide snapshot 
teletypewriter dumps. In addition, controls are provided to vary mar- 
gins (threshold) in both off-line call and program stores. 

{Hi) Manual restart controls are provided as a final backup when 
more normal communications fail. Included are such items as forcing 
and locking either control unit on line, and initiating memory 
(call store) clearing operations to restart the system when it loses 
"sanity." 

2.2.3 Documentation 

The use of all these man-machine facilities is built on a hierarchy of 
documents with which the craftsman must be familiar. 

The first is the Input Message /Output Message Manual which de- 
fines all possible teletype messages which are programmed into the 
machine and lists all acceptable input requests and the expected sys- 
tem response to them. Figure 1 shows a typical input message entry. 
This one is used to update the system calendar and each variable 
field is fully defined in the manual. 



It i nput_messa(.;e_ format 

UB SY:DAT:mon 'lay year b! 

2. EXPLANATION OF MESSAGE 

Used to enter the current date into the system 

UB =Utility Base level request. The request will be 

performed immediately . 
SY = SYstem 
DAT = OATe 

nan = month of the year (1-12 decimal) 

day = day ot the month (1-31 decimal) 

year = year - decimal, last two digits 

b - day of the week - decimal 0-6. Sunday is 

3. SYSTEM RES PONS Efi 



OK = message was OK,. It was accepted and the work re^uested^ 
has been accomplished. 

If the message came rrom paper tape, the tape re 
be turned on. 

NG = the message was not accepted 
action or data fields were, 
system state may noj, 
proceedures, 
the 



Fig. 1 — Sample Input Message Manual description: system calendar update. 
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When the output message gives specific diagnostic data, this Man- 
ual points to a Trouble Locating Manual (see Fig. 2) which translates 
the data field of the message (Trouble No. column) into a specific 
set of suspect circuit packs. Brief remarks are incorporated to describe 
the trouble area functionally. Should this information prove inade- 
quate (that is, replacement of packs does not clear the trouble) , sub- 
sequent sections of the Trouble Locating Manual are arranged to give 
a detailed functional description of the test and an interpretation of 
the digits in the trouble number. Repair procedures might then in- 
volve reference to the more basic maintenance documents, including 
program listings and schematic drawings. Experience indicates that 
better than two-thirds of the troubles should be cleared through use 
of only the simple trouble number translation. 

A series of Bell System Practices are provided as basic training 
documents and give overall system descriptions in addition to de- 
tailing all operational and administrative procedures the craftsman 
must perform. These documents contain extensive cross references. 



BELL TELEPHONE IABOBA TORIES, INCORPORATED 



TKCUBLE NO. 
(CONT) 



EQUIP LCC - TYPE 



IO-132-30A A110 
10-132-28 AltO 




3812 

3812 Q00001 



3812 000002 



3812 000003 



10-132-19 A110 

10-032-36 AUOe 

10-032-36 AU03 

FAILURE OF Z CP D TRANSIATION FROM THE EA 

10-030-16 A103 

10-030-17 A101 

10-030-18 A138 

10-030-13 A103 



10-030-16 
I0-C30-17 
10-0 30-18 
10-030-13 



A10 3 
Af01 
A138 
A103 



10-030-16 A103 




Fig. 2 — Sample Trouble Locating Manual. 
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2.3 Programming Organization 

In order to control all the maintenance, communication, and admin- 
istration functions described above, the program structure is organized 
into a hierarchy of tasks performed at base level times and interrupt 
times. In addition, an initialization, or restart, procedure is provided 
under certain circumstances, resulting in a break in the continuity of 
program flow. 

2.3.1 Base Level Programs 

All deferable, or low priority, maintenance tasks are handled at 
the end of the normal call processing transient call record scan. 2 
Items covered here regularly include processing the waiting list of 
incoming and outgoing teletypewriter messages, including such func- 
tions as timing, format translation, and distribution of messages to 
client programs. 

The base level maintenance monitor program determines which 
additional tasks are to be performed. The normal sequence of these 
tasks may be modified by any maintenance activity that has taken 
place since the last transient call register scan. For example, if a 
check circuit output has automatically switched out and inhibited a 
processor, this fact is taken into account here and diagnostics on 
the off-line take preference over lesser periodic routines. Similarly, 
manual requests may have a higher priority. One or more of the fol- 
lowing functions is performed at this time: 

(i) Execution of any manual (teletypewriter-inserted) requests 
such as demand tests or make-busy functions. 

(ii) Updating of off-line call stores after an interval of non- 
synchronized processor operation. 

(Hi) Diagnostics on the processor and on peripheral units as a 
result of calls from trouble recovery routines or manual requests. 

(iv) Short-term periodic routines performed on a schedule, in- 
cluding items such as processor trouble detection programs and 
initiation of trunk tests. 

(v ) Long-term periodic routines intended to exercise those circuits 
not used in normal operation, primarily check circuits and on-line/ 
off-line switching facilities. 

(vi) Miscellaneous functions, including error count tabulations in 
the call store (such as plant registers), directed scans of various 
ferrods assigned to maintenance functions, and control of maintenance 
center displays. 
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2.3.2 Input-Output Interrupt Level Programs 

In addition to the scanning, digit receiving, and outpulsing func- 
tions that call processing handles during the periodic 25-millisecond 
input-output interrupt routine, maintenance functions requiring close 
timing are also executed here. Items covered here include: 

(i) Trunk and service circuit tests requiring precisely timed actions 
are executed first to avoid stagger resulting from variable execution 
time in the various parts of the interrupt program. 

(u) The network controller's maintenance ferrods are checked for 
proper operation based on the previous interrupts actions. Failures 
result in peripheral order buffer retries and eventual call teardown 
if no working mode is found. 2 The base level routine is notified of the 
troubles in order that diagnosis may be initiated later. 

(Hi) All teletypewriters are scanned for new inputs and new char- 
acters are outputted to active teletypewriters. 

2.3.3 Maintenance Interrupt Level Program 

The maintenance interrupt has the highest priority. It is initiated 
by processor mismatches as detected by the call store comparison in 
the maintenance center, by some peripheral unit and input-output 
errors, and by manual request. All three sources come in at the 
same priority level and block each other until their respective tasks 
are complete. These error signal interrupts immediately initiate 
trouble recovery programs and after the appropriate recovery actions 
(for example, switch of on-line/off-line control unit configuration or 
scanner controller) the problem is passed on for further resolution 
to the lower priority base level programs. 

2.3.4 Initialization Restart 

If the processors switch their on- off-line configuration while the 
off-line is out of synchronism, or if they go into a multiple switching 
mode from which they cannot recover, the program is restarted from 
a fixed location to provide an orderly return to the beginning of the 
call processing monitor cycle. The initial source of the trouble may 
be either hardware or software difficulties. A count of the number of 
restarts incurred during a given time is used to progressively clear 
out the call store until the system recovers its "sanity." This strategy 
involves clearing or releasing various transient and maintenance only 
locations of call memoiy while preserving most stable talking path 
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records. An initialization restart of this type may also be manually 
initiated, including a complete call clearing capability when necessary. 

III. ADMINISTRATIVE FUNCTIONS 

A particular telephone office is denned to the call processing pro- 
grams by a series of parameters and translation tables in program 
store that describe that office's network traffic characteristics, trunk- 
ing facilities, routing and charging constraints and all individual sub- 
scriber definitions. The initial office traffic and trunking engineering 
is performed as a result of the operating company's analysis of needs 
and results in ordering the proper equipment frames for initial instal- 
lation. All program store translation and parameter contents for initial 
service are processed by means of an Office Data Assembler program 
in a general purpose regional computation center. 

As time passes, the typical office evolves and grows, and individual 
subscriber definitions change as people move and new services are 
offered. In order to respond to these changes, a series of administrative 
(recent change) programs are resident in the No. 2 ESS machine to 
permit a virtually continuous memory updating capability. Subscriber 
changes are generally originated by way of the plant service center, 
while network and trunking modifications are based largely on the 
traffic measurements performed by the processor on its daily calling 
rate pattern, and on operating company projections. In the case of 
major equipment growth additions, translations are changed by means 
of a new Office Data Assembler run. 

In addition, another class of administrative functions known as 
plant measurements are maintained for each office. These involve 
both service measurements to reflect actual effects on service as seen 
by the customer (for example, total customer receiver time-outs) and 
performance measurements to reflect the basic health of the machine 
in terms of such items as failure and error counts of various pieces of 
equipment. These measurements are useful in directing attention to 
areas where additional maintenance effort appears justified. 

3.1 Traffic Measurements 

Traffic measurements are made throughout all phases of the call 
processing programs and are recorded in call stores. Data are put 
out through a dedicated teletype channel on assigned quarterly, 
hourly, daily, and weekly schedules, or on demand. Various combina- 
tions of the three basic types of measurements (peg counts, usage, 
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and overflow) are performed in such areas as networks, junctors, 
service circuits, trunks, and office calls. Their usefulness can perhaps 
best be described by some examples. 

In the network, usage counts are maintained for each concentrator 
for use in load balancing and line assignment. In the junctor area, 
usage counts arc kept on wire and circuit junctor groups for load 
balancing between networks and for intraoffice-interoffice call rate 
measurements. Trunk measurements are made for each group with 
various combinations of peg count, usage, and overflow in outgoing, 
incoming, and two-way categories. Usage counts are also made on 
subscriber items such as the various custom calling services. 

When these data indicate the need for relatively minor reconfigura- 
tions without major hardware additions, translations are changed 
by local recent change procedures in the No. 2 ESS machine. In 
other cases, more elaborate processing is required at a regional 
computation center. 

3.2 Recent Change Procedures 

The types of items that fall in the recent change category include 
service orders (subscriber additions and changes), trunk additions, 
service observing, addition of new routes, and changes in office code 
treatment. The service orders are usually remotely entered from a 
plant service center; the other items are performed locally by the 
craftsman from the maintenance center teletypewriter. Again, an ex- 
ample will perhaps best describe the process. 

The original office record input forms (Fig. 3) indicate that direc- 
tory number 736-0056 is vacant. A new line is to be added. The in- 
formation to be inserted includes this chosen vacant code together 
with its assigned terminal equipment number, associated billing num- 
ber, desired features (dial transfer and add-on conference) and line 
class code (1FR single party, flat rate, residence). This information is 
keyed in on the service order teletypewriter channel as shown in Fig. 4 
in the standard universal service order code input language. This 
information is processed by the No. 2 ESS, checked for validity, con- 
verted into its binary equivalent, and stored in a special call store 
recent change buffer. 

The call store buffer, with a capacity of 512 updated program store 
words per module of permanent magnet twistor translation storage 
(16,384 words), is searched by the call programs for changes each 
time its associated permanent magnet twistor translations are ac- 
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ft RC S0/ <-- ADMINISTRATION RECENT CHANGE SERVICE ORDER. 

TYP NEW/ ■* TYPE NEW LINE. 

IN 736 0056/ •* TELEPHONE NUMBER. , 

TEN 022200/ < TERMINAL EQUIP NUMBER (NW* GROUP, CONC ., SW. , LEVEL ) 

BTN 736 0050/-* BILLING TO NUMBER. 

FEA DTR ADD/ ■* FEATURES: ADD-DIAL TRANSFER 

FEA AD3 ADD/ < FEATURES: ADD-ADD ON CONFERENCE 

IBB IFR! 0K * LINE CLASS CODE - SINGLE PARTY, FLAT RATE, RESIDENCE 

Fig. 4 — Sample recent change teletypewriter service order input message. 

cessecl. As changes and additions accumulate in the buffer, an output 
indication is given on the teletypewriter that the maximum capacity is 
being approached and that action should be taken to update the perm- 
anent magnet twistor magnet cards. This is accomplished by means of 
the single card writer contained in the maintenance center frame. 1 

The contents of the recent change call store buffer are analyzed by 
program and translated into the affected set of permanent magnet 
twistor cards (128 words per card) which are identified in a teletype- 
writer message upon request. These selected cards are individually in- 
serted in the single card writer and the entire card is magnetized by a 
program that copies the old on-line translation plane from permanent 
magnet twistor with all appropriate call store buffer recent change 
entries incorporated. When all affected cards have been magnetized, 
they are inserted in the off-line program store, and automatically 
verified against on-line program store plus call store change buffers. 
If successful, the on-line/off-line processors are switched, the pro- 
cedure is repeated, and the call store buffers are cleared to allow for 
accumulation of the next group of changes. 

Depending on office size, rate of change, and local office manning 
practices, this translation updating procedure may be performed per- 
haps eveiy two weeks. In larger offices, more automatic translation 
updating procedures would be provided if the change rate warranted 
it. Throughout the process, office record forms must be kept up to date 
in order that future changes do not create conflicts. 

3.3 Major Translation Change Procedures 

Certain translation changes require a simultaneous change of large 
blocks of data, plus extensive validity tests and error checking on the 
input data. In some cases, even with reasonable amounts of data, the 
translator structure is sufficiently complex to make the relatively 
simple recent change program inadequate. The changes which fall in 
these categories are situations which require major office equipment 
growth (network frames, trunk frames, storage additions, and so on), 
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major revisions in routing, screening, and charging translations, and 
general reorganization of existing translator origins as additional 
storage is added. 

In all of these cases, the problem can usually be anticipated well 
ahead of time and so the rapid response characteristic of the recent 
change procedure is not essential. As a result, use is again made of the 
Office Data Assembler program as shown in Fig. 5. Used here in its 
update mode, the office data assembler program accepts new punched 
input forms, validates and error checks them, and incorporates the 
data into the existing office translators. To ensure that the actual No. 
2 ESS office translator being updated is consistent with the administra- 
tive forms, the Office Data Assembler operates on an actual dump of 
the program store contents, and is capable of providing a new set of 
adminstration records. 

IV. CONTROL COMPLEX MAINTENANCE 

The control complex, as stated earlier, consists of two control units 
and a maintenance center. Only two configurations of control com- 
plex equipment are possible: either one of the control units on-line 
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(controlling peripheral equipment) and the other off-line. Normally, 
the two control units are running in command synchronism with only 
the on-line control unit actually performing a controlling function. 

Problems occur in control complex operation when either a solid 
or marginal circuit fault occurs. Both maintenance circuits and pro- 
grams are used to detect a trouble condition, to recover a working 
system, and finally to either determine that a transient error occurred 
or to diagnose what circuit fault exists. Figure 6 summarizes how 
troubles are detected and what happens after a trouble is found. 

Maintenance circuits and programs can handle a wide variety of 
different faults. For example, programs and circuits used in fault 
diagnosis are designed to handle faults commonly encountered in the 
high-speed transistor resistor logic used in No. 2 ESS. These faults in- 
clude: (i) open and shorted semiconductors, (ii) open resistors, and 
{Hi) open connector contacts on pluggable circuit packages. 

Programs and circuits which recover a working system after a fault 
occurs can handle not only these faults but also faults which cause 
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marginal circuit operation or faults such as wire opens or wire crosses 
which should not occur frequently. 

4.1 Trouble Detection Methods 

A combination of check circuits and program tests are used to find 
circuit troubles as shown in Fig. 6. 

4.1.1 Check Circuits 

Typical internal control unit check circuits detect problems such 
as store access faults and automatically attempt to recover a working 
system by switching control units if a fault is detected in the on-line 
control unit. Check circuits are used when program tests cannot detect 
a fault fast enough or reliably enough. For example, the call store 
access check circuit detects faults such as shorted access diodes which 
cause marginal store operation. This class of faults cannot be solidly 
tested by any type of program check. 

A maintenance center check circuit compares the call store input 
registers in the two control units when call store operations are per- 
formed in synchronism. A fault or transient error in almost any part 
of either control unit quickly results in a call store input register 
mismatch since almost all tasks performed in both the program control 
and input-output involve call store writing. A check circuit mismatch 
signal interrupts the program currently being run and causes a trouble 
recovery program to be called in which attempts to find the faulty 
control unit and place it in the off-line mode. This trouble recovery 
program is described in Section 4.2. 

Each program control contains a program timer circuit which is 
designed to backup other detection methods. Normally, an on-line con- 
trol unit program zeroes a counter in both program timers at least 
once every 300 milliseconds. If, however, a trouble condition exists 
such as a program loop which prevents a timer from being zeroed 
(within 320 milliseconds for the on-line and 640 milliseconds for the 
off-line), the timer will time-out and automatically produce a switch. 
The new on-line control unit is automatically forced to run an initial- 
ization restart program which attempts to establish a working system. 

4.1.2 Program Detection 

Short-term periodic program tests detect the same troubles found 
by the mismatch trouble recovery program since exactly the same 
program tests are run. These tests are continually run interleaved with 
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call processing, and detect faults within approximately five seconds, 
compared with microseconds in the case of the comparator circuit. 
These tests provide trouble detection even when the control units 
are not being matched and provide a backup to the comparator check 
circuit. In fact, if very rapid fault detection were not required, it 
would not be necessary to have a comparator circuit. 

Detection program testing of the input-output presents a special 
problem since this unit is normally operating independently of the 
program control. The program control has to stop the input-output, 
save, test, and restore input-output registers, and restart the input-out- 
put each time an input-output detection test is run in order to prevent 
interference with normal input-output operation. 

Long-term periodic exercise programs perform tests on circuitry 
not normally checked by other detection means. For example, both 
the off-line program and call stores are placed in marginal modes and 
tested for correct operation. This is accomplished by applying high 
and low values of threshold voltage to store readout amplifiers at 
the same time words are read out and checked for correctness. This 
check attempts to force store problems to show up before they can 
affect actual system operation. In addition to store margin tests, the 
complete diagnostic test sequence as well as a test of control unit 
switching is performed to force troubles to show up in circuits not 
exercised in normal system operation. 

An additional periodic check on correct system operation is per- 
formed by the base level maintenance monitor which checks the 
system state once each program scan by looking at certain key flip- 
flops. If an abnormal state is found, a trouble has occurred and diag- 
nostics are called in. For example, diagnostics are called in if the 
off-line control unit is found inhibited and not running programs when 
it is supposed to be running in synchronism with the on-line control 
unit. 

4.2 Trouble Recovery 

After a trouble is detected, automatic circuits and, in some cases, 
trouble recovery programs are used to obtain a working system. The 
physical action taken to restore a working system is very simple as 
a result of the simple split redundancy. If a fault is found in the 
on-line control unit, control units are switched and the new off-line 
control unit is inhibited from running programs. If, on the other 
hand, a fault is found in the off-line control unit, the only action 



ADMINISTRATION AND MAINTENANCE 



2783 



taken is to inhibit the off-line control unit. These actions are auto- 
matically initiated when faults are detected by control unit check 
circuits or by programs. After trouble recovery action, the base level 
maintenance monitor will automatically call in diagnostics when the 
off-line control unit is found inhibited. 

4.2.1 Mismatch Detection Programs 

A special mismatch trouble recovery program is run in the on-line 
control unit after a mismatch to determine if the on-line control 
unit contains a fault. This program first inhibits the off-line control 
unit and then calls in detection tests. These detection tests are identi- 
cal to those run during short-term periodic detection. However, all the 
tests are run at once instead of being interleaved with call processing 
in order to minimize test time. Tests are run in a sequence which at- 
tempts to test as much circuitry as possible as quickly as possible (all 
tests are run within 100 milliseconds). As a result, functions tested 
first are those which exercise the most circuitry. Actually, a large 
amount of circuitry can be assumed to be good at the beginning of the 
mismatch detection test, since many failures are detected by internal 
control unit check circuits rather than by the maintenance center 
match circuit. 

If a solid fault in the on-line control unit is detected by a mismatch 
detection program, a control unit switch is automatically generated 
which inhibits the new off-line control unit and uninhibits the new on- 
line control causing an initialization restart program to be run in it. 
The new on-line control unit will be slightly out of date since no 
new inputs were recorded in its call store while mismatch detection 
programs were being run; however, it will not contain erroneously 
processed information. Shortly after the switch, the base level main- 
tenance monitor will call in diagnostics when it finds the off-line 
control unit inhibited. 

If the on-line control unit successfully passes all detection tests, 
diagnostic tests of the off-line control unit are called in. These tests 
attempt to determine if an off-line control unit fault caused the mis- 
match. Diagnostic test failure results in formation of a teletype 
printout which attempts to pinpoint the location of the fault. Success- 
ful completion of all diagnostic tests indicates a transient error 
condition caused the mismatch. If this is the case, control units are 
placed back in synchronism and one is added to a call store word 
containing the number of all test pass conditions. 
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4.2.2 Peripheral Unit Trouble Recovery Programs 

Certain faults in the input-output unit do not result in control units 
mismatching but instead cause peripheral units to be accessed im- 
properly. Detection of these faults by certain internal input-output 
unit check circuits cause peripheral unit trouble recovery programs to 
be called in. These programs, described in detail in Section V, attempt 
to recover a working system by retrying peripheral orders plus 
switching peripheral and control unit equipment if necessary. 

4.3 Diagnostics 

Diagnostic programs are automatically called in by the base level 
maintenance monitor after trouble detection and recovery has been 
completed or can be manually requested via the teletypewriter. The 
objective of diagnostic programs is to produce a teletypewriter print- 
out which isolates a fault to as small an area as possible. The 
following paragraphs describe circuitry and programs which are used 
to achieve this. 

4.3.1 Maintenance Circuitry Provided for Diagnostic Testing 
Special control unit circuitry is provided to allow diagnostic tests 
to resolve the location of faults to a relatively few number of circuit 
packs. External maintenance commands allow the on-line control unit 
to control and monitor actions performed in the off-line unit. As 
shown in Fig. 7, the contents of on-line control unit registers can be 
gated to off-line registers or vice versa. Also, control functions such 
as starting and stopping the off-line control unit can be performed. 
This can be accomplished by resetting (to start) or setting (to stop) 
the off-line inhibit flip-flop. Figure 7 shows examples of two different 
external commands. One command executed in control unit causes 
a register in that unit to be gated to a register in control unit 1 via the 
program gating busses. The other command starts control unit 1 by 
zeroing the inhibit flip-flop. 

In a typical diagnostic test, the on-line control unit uses external 
commands to test off-line circuitry as follows: 

(i) Stop the off-line control unit, initialize command timing and 
gate a program command directly into the program store output 
register. 

(u) Execute the command present in the off-line program store 
output register. 
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Fig. 7 — Typical diagnostic program commands. 

(Hi) Look directly at an off-line register to see if the command was 
executed correctly. 

This testing method allows off-line control unit circuitry to be 
tested by the on-line control unit without having to rely on the correct 
operation of a large amount of off-line circuitry, including the off-line 
program store. 

Certain internal maintenance commands executed by the off-line 
control unit are also used to aid diagnosis. These commands can either 
be executed under direct control of the on-line control unit as outlined 
above or as part of a program sequence controlled by off-line internal 
command logic (the command is read out of the off-line program 
store). In this second case, external commands are only used to 
establish an initial off-line program starting address, to start the 
program sequence running, and to look at test results after off-line 
testing is completed. 
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Internal maintenance commands are used to test areas such as the 
input-output unit which are not accessable to normal internal com- 
mands. For example, a typical input-output diagnostic test uses an 
internal maintenance command to set a flip-flop which prevents clock 
pulses from enabling input-output gates. This stops all input-output 
operations and allows the program control to execute additional in- 
ternal maintenance commands to test specific input-output functions. 

Commands which gate to and from maintenance center registers 
are also used by the on-line control unit to both control and monitor 
actions performed in the off-line unit. For example, outputs of off-line 
check circuits can be observed by looking at the maintenance center 
error register. 

4.3.2 Diagnostic Programs 

It is important to order diagnostic tests so that a command or 
circuit used to test another command or circuit has itself been pre- 
viously tested. If this rule is followed, the only circuitry under 
suspicion if a given test fails is the circuitry currently being tested 
and only a single printout indicating what test failed and how it failed 
is required to provide diagnostic information. Good test access from 
the on-line to off-line control unit allows tests to be ordered in this 
manner in most cases. 

Figure 8 is a simplified flowchart of the diagnostic sequence. First, 
on-line to off-line access is tested both via the maintenance center 
and external commands. Success of these test blocks insures that 
sufficient on-line to off-line access is available to test off-line circuitry 
in detail. The next test blocks use this access to test various parts of 
the off-line control unit including command logic, the program store, 
and the call store. Near the end of the diagnostic sequence, enough 
off-line circuitry has been tested so that diagnostic programs can be 
run entirely in the off-line control unit. Tests such as input-output 
diagnostics are run in this manner under control of the on-line control 
unit. In many cases, these same tests are also used in periodic and mis- 
match detection. 

The entire diagnostic sequence is always called whenever control 
complex diagnostics are called in automatically. This insures that 
faults are caught by the proper diagnostic test and that a meaningful 
printout is produced. No attempt is made to use the results of trouble 
recovery programs in order to shorten the time required for diagnos- 
tic testing since diagnostic test time (about 30 seconds) is an in- 
significant part of the total time required to repair a fault. 
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Fig. 8 — The diagnostic sequence. 

In certain cases, after a given diagnostic test failure, up to two 
other diagnostic tests further down in the diagnostic sequence may be 
automatically requested in order to improve resolution. In all cases, 
if more than one test fails, the diagnostic printout is generated by the 
failing test closest to the end of the diagnostic sequence. 

As shown in Fig. 9, a test block is subdivided into test segments. 
Numbers associated with test blocks and test segments are used to 
uniquely identify the failing test in a teletypewriter printout. A typical 
diagnostic printout shown in Fig. 10 contains a block number 
and segment number identifying the test that was in progress 
when the printout was formed. It also contains a data word indicating 
exactly how the test failed; for example, what bit of a particular 
register is bad. This information is used by the craftsman in referenc- 
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ing the translation section of the trouble locating manual in order to 
find replacement circuit packages. 

A failure in a circuit containing state memory may in the worst case 
leave the circuit in any one of 2* states if N memory elements are 
present. If no attempt is made to initialize this state memory before a 
diagnostic test is run, up to 2 N different diagnostic printouts may be 
generated. In general, good on-line to off-line communications allows 
setting of state memory to a single consistent state. This means only 
one printout will be produced for a given fault independent of the con- 
trol unit state at the time of failure. For certain faults, it is not pos- 
sible to establish a single initial state before diagnostic testing is 
started. A fault of this type may produce a different printout for each 
possible initial state. In these cases, an attempt is made to list all 
possible printouts in the trouble locating manual. 

Often, in order to obtain good diagnostic resolution, it is desirable 
to perform combinational checks of logical rather than sequential 
operation, since combinational checks only depend on the present 
inputs and not on the past machine states. However, much of the 
control unit normally operates in a synchronous sequential manner: 
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Fig. 9 — A diagnostic test block. 
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operations performed during clocked intervals are dependent not only 
on present inputs but also on past machine states. An operation per- 
formed during such a clocked interval can be checked combinationally 
if the state memory remembering the past machine states can be 
gated to by program and if the machine can be prevented from 
cycling through the sequence. Figure 11 is an example of circuitry in 
the input-output unit which is checked in this manner. The following 
steps are used to check gate setb. 

(i) Set a control flip-flop to prevent clock pulses from enabling 
input-output logic gates or resetting flip-flops. 

(ii) Set the A flip-flop and clear the B flip-flop by direct gating via 
the program gating bus. 

(Hi) Enable clock signal poon for one clock interval by executing 
a special internal maintenance instruction. 

(iv) Check to see if the B flip-flop is correctly set via program 
gating bus access. 

Failure of the B flip-flop to be set results in a generation of a 
diagnostic printout which gives the block and test segment number of 
the failing test plus a test result data word. 

The program timer which is an asynchronous seqential circuit is 
also checked in a similar manner except that, instead of stopping the 
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Fig. 11 — Combinational testing. 

clock, leads from a control flip-flop are used to cut feedback paths 
and prevent sequential operation while testing is being performed. 

4.4 Error Strategy 

A control complex error is defined as a trouble which is detected by 
either circuit or program means and then disappears when diagnostic 
testing is performed. Errors can be produced by such things as a fault 
causing a marginal circuit condition or by noise which changes a 1 to 
a or vice versa. Errors had to be considered when designing control 
complex maintenance programs in order to prevent them from ad- 
versely affecting call processing and to insure that a repeated error 
caused by a marginal circuit fault results in some attempt at correc- 
tive action. 

4.4.1 Mismatch Detection 

Mismatch detection tests attempt to minimize the effect of errors or 
standby faults on call processing by turning on input-output work 
(both circuit and program) very soon after the initial mismatch. 
Figure 12 shows detection test ordering and actions taken at various 
times after the initial mismatch. Note that the off-line control unit 
is inhibited and a test of the input-output is performed immediately 
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after mismatch. If this test is successful, input-output digit scan 
functions are resumed. After an additional short test of program 
control logic (approximately 5 milliseconds after mismatch), the 
input-output 25 millisecond interrupt is allowed to resume. 

Allowing the input-output unit and the 25 millisecond interrupt pro- 
gram to resume operation very soon after the mismatch, prevents in- 
coming information from being lost when no solid fault is present in the 
on-line control unit. 

4.4.2 Repeated Errors 

The base level maintenance monitor keeps a count of the number of 
times automatically initiated control complex diagnostics return an 
all-tests-pass indication. If this count exceeds a fixed threshold in a 
ten-minute interval, further tests are initiated and no further auto- 
matic attempt is made to return to synchronization. Thus, some pro- 
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Fig. 12 — Mismatch trouble recovery. 
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tection is taken against transient and intermittent faults overloading 
the system. 

V. PERIPHERAL SYSTEM MAINTENANCE 

Both duplication and engineered redundancy are used for reliability 
in the peripheral system. 1 - 3 Duplicated control circuitry (controller) 
is used in the peripheral system wherever a trouble would affect a 
significant portion of the equipment. Either controller may be ac- 
cessed from either control unit. Peripheral decoders, each of which 
controls up to four trunk or service cricuits, are not duplicated. Each 
may also be accessed from either control unit. Network links, trunks, 
junctors, and service circuits are provided in sufficient numbers that a 
faulty unit can be avoided without significantly affecting service. 

Troubles must be detected quickly and the faulty unit identified 
and removed from service. Several troubles may have to be tolerated 
at the same time, including those induced by the craftsman when 
attempting repair or making additions to the peripheral system. 
Maintenance programs provide a "best reasonable" mode of operation 
and craftsman interface. Extensive and generalized troubleshooting 
facilities are necessary because of the frequent equipment additions 
to the system and the ratio of wired-in circuitry to replaceable plug-in 
circuit boards. 

s.i Trouble Detection and Recovery 

Many of the troubles in the peripheral system are detected by 
check circuitry during the normal execution of an order to the periph- 
ery. This is especially true of troubles that have a significant effect on 
the system where rapid detection and recovery is most important. 
Some troubles are detected by the call programs which check for ex- 
pected results at strategic points in a call, or "become suspicious" 
at unlikely occurrences. 2 They may initiate further tests immediately 
or provide results to be accumulated for further actions when an 
error threshold is reached. 

The remaining troubles usually do not seriously affect service and 
are detected by manual and automatic routine exercise, audits, and 
trouble reports. 

5.1.1 Scanner Troubles 

The scanner organization, duplication, and interconnection are 
described in Refs. 1 and 3. Each of the duplicated scanner controllers 
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may be accessed by either control unit. For some troubles in scanning, 
a controller would be removed from service. For other troubles, a 
control unit would have to be removed from service. Since interrogate 
and readout windings are not duplicated, there are troubles which 
affect a single row of 16 ferrods, or a colmun of up to 64 ferrods for 
any of the four controller-control unit combinations. This prevents use 
of these ferrods. 

Detection and recovery actions are indicated in Fig. 13 with the 
separation between circuit and program functions. Troubles in the 
selection of a scanner or row are detected by check circuits in the 
input-output or scanner. 1,3 The scanner may be accessed by a program 
order or by the autonomous input-output logic in the control unit 
which scans for digits and line originations. 1 

A program scan order is immediately followed by a program check 
for an error indication and the trouble recovery program is called as 
a subroutine on error. For an autonomous digit scan error, the trouble 
recovery program is entered by an interrupt during which autonomous 
scan functions are stopped. The autonomous line scan stops when a 
new origination, or an error, or a last row is detected. The 25 milli- 
second interrupt program which controls this function detects such 
a trouble later on rescan by a program order. 

Scanner output troubles are not detected by check circuits. Some 
of these are detected by defensive design in programs which use the 
scanner. For example, supervisory scan programs suspect trouble for 
supervisory changes in successive rows, or more than one change in a 
row and call an immediate scanner output test as indicated in Fig. 13. 

Some output troubles cause a control unit mismatch on a subsequent 
call store read or write. An output test is performed for all scanners 
following a mismatch. Detection of other troubles relies on routine 
exercise with the diagnostic test. 

Individual ferrod troubles are not detected by any of these checks. 
These troubles show up as faulty operation of the circuits to which 
these ferrods are assigned. 

6.1.2 Scanner Trouble Recovery 

The trouble recovery programs, after verifying a trouble, first try 
the other controller and then the other control unit in order to 
identify the faulty unit. The order that failed or the test that detected 
the trouble is used for these attempts. A bad interrogate or read- 
out winding is assumed and recorded if none of these tries are success- 
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Fig. 13 — Scanner trouble actions. 



ful. Once a controller or control unit is marked out of service, it will 
not be restored automatically or used in a later trouble recovery at- 
tempt unless the number of bad rows or readout windings becomes 
excessive. When the number of errors recorded becomes excessive, 
the out-of-service unit is restored automatically and the on-line unit 
removed from service. This allows the corrective action for a second 
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trouble to override that for the first if the second trouble has a signifi- 
cant effect on service. 

In resuming normal operations, a later diagnosis is requested for 
any new trouble found, and the call operations may be cleared or 
skipped if a row or column cannot be scanned. For example, a row 
trouble detected by supervisory scanning would cause that row to be 
skipped, whereas a row trouble detected by digit receiving would 
cause the digit receiver and path to be idled and a reorder tone con- 
nected. A test of the circuits in the call to which the ferrods are as- 
signed would be initiated. 

5.1.3 Network Connection Troubles 

Orders to the network require several milliseconds to complete. In 
order to allow efficient use and control of the network access, orders 
are sent by the program to all networks in an order execution cycle 
of up to 10 milliseconds every 50 milliseconds. 2 

As indicated in Fig. 14, some check results are available immedi- 
ately after the network order and path data is sent to the network. 1 - 3 
Other check results are not available until the network order has had 
time to complete its operation. These latter check results, which are 
indicated by status ferrods, are checked for all network controllers 
just before beginning a network order execution cycle every 50 milli- 
seconds. Any errors that have occurred on the last order execution 
cycle are known at this time and cause the trouble recovery program 
to be started. 

5.1.4 Network Trouble Recovery 

Each network order attempt by the trouble recovery program will 
take at least 50 milliseconds and so must be interleaved with call 
processing. The trouble is verified by retrying the order. The periph- 
eral order buffer from which the order was sent must be determined so 
that the call may be stopped and the path and order data obtained. 
The other calls are not affected. The order is retried first with the 
other controller, if in service, and then with the other control unit, if 
synchronized, and the controller or control unit removed from service 
if found bad. 

Since a sequence of network orders are sent before results are 
checked, a control unit fault may result in several controller troubles 
from that sequence of orders. In this case, the other control unit is 
tried first to minimize delays. In other cases, controller trouble indica- 
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Fig. 14 — Network trouble actions. 

tions are handled one at a time and a second trouble occurrence while 
resolving the first, stops all order execution until the first trouble is 
resolved. The controller status ferrods may indicate trouble such as 
loss of power or input noise when no order was sent. In these cases, 
no peripheral order buffer that accessed the controller will be found, 
and the entire sequence of network orders for that execution cycle is 
repeated to verify the trouble and try other combinations of controller 
and control unit. 

Trouble indications may also occur because of nonduplicated net- 
work circuitry and the order will not be successful with the duplicate 
controller or control unit, or possibly the duplicate controller or con- 
trol unit may have been left out of service by a previous trouble. Path 
data are accumulated in these cases to determine the extent and fre- 
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quency of the trouble. Faulty links are removed from service through 
correlation techniques. If the trouble is extensive in the frame and 
affects many input terminals, the other controller or control unit is 
automatically restored to service, as was the case with scanners. A 
record of the number of peripheral units of any type with extensive 
trouble with either control unit is maintained in this way so that the 
best control unit can be favored. 

Lack of continuity in the network is usually detected by call proc- 
essing. Path data for no continuity are correlated with that for other 
path troubles so that faulty links can be removed from service. 

After a successful network trouble recovery attempt, the operations 
for that call are continued. If the network trouble recovery is un- 
successful, the call is torn down and a reorder tone is connected where 
possible. 

5.1.5 Trunk and Service Circuit Access Troubles 

The central pulse distributor and its supplementary pulse distributor 
are used to change trunk and service circuit relay states. Selection and 
output troubles for these units are detected by check circuits when 
program orders are sent, and the trouble recovery actions are similar 
to that described for scanners. A central pulse distributor trouble may 
require use of the other control unit whereas a supplementary pulse 
distributor trouble may require use of the other controller or, for some 
troubles, the other control unit. 1 - 3 It may not be possible to find a 
successful configuration for some output troubles. In these cases, the 
output number is recorded and the circuits that are assigned to that 
output cannot be used. Trunk and service circuit troubles may also 
show up as troubles in setting up a network connection or in a time 
out of some operation in the call where there are several possible 
sources of the error. When such circuits are possible sources of trouble 
in a call, they are put in a list for testing. The network links and those 
circuits which pass their tests or which cannot be tested are correlated 
on successive troubles as indicated in Fig. 14 to help isolate the source 
of the trouble. Various operational trunk and service circuit tests are 
further described in Ref . 2. 

5.2 Diagnostic Organization and Use 

5.2.1 Common Controllers 

The network, scanner, and supplementary pulse distributor con- 
trollers are tested by a sequence of functional tests. For example, the 
functional test blocks for the scanner controller are: 
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(i) Access test with proper addresses, observing check circuit results 
for trouble. 

(ii) Check circuit test using illegal addresses to see if the check 
circuits are capable of indicating trouble. 

{Hi) Output test using a test signal that induces outputs on all 16 
readout leads. 

Each of these functions is relatively independent of the others and 
the trouble number is formed from results of the first functional test 
block which fails. Several test results within a functional test block 
are usually combined in the test number to improve the resolution and 
provide error data to help the craftsman when the Trouble Locating 
Manual listing is insufficient. For example, in the scanner controller, 
the one-out-of-64 ferrod rows is selected by an 8-by-8 coincident core 
matrix. 3 The two one-out-of-eight selections are received over 16 leads 
of the peripheral unit address bus. In the access test block, all 64 rows 
of the ferrod matrix are accessed in sequence. The or function of the 
proper peripheral unit address bus contents for all failing addresses 
provides a 16-bit error data field (6 octal characters) which is part 
of the trouble number in the diagnostic message shown here: 

35 MI MS DGN 02 1- 1264 004377 



Maintenance 
Information 



Trouble 
number 

Diagnostic for Master 

Scanner 02, controller 1 



Message occurred 35 minutes 
after the hour 



In a particular test block, failure modes, such as bus receiver or core 
driver faults, are easily analyzed by class when designing the program 
for determining the associated trouble numbers. These failure modes 
can then be ignored in the program design and analytical generation 
of Trouble Locating Manual entries for any functional test block that 
follows in the sequence. 

Peripheral system access troubles may be caused by faults in the 
control unit access that are not detected by the control unit diagnostic 
program. The trouble recovery programs would normally leave the 
faculty control unit off line. The same functional tests that are used 
for testing a controller are also used for testing the access to a con- 
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troller from the off-line control unit. Additional tests are needed in 
some cases for testing input-output control circuitry that is not 
tested by the control unit diagnostic program. 

When testing the control unit access, the orders are executed from 
the off-line control unit, and the first character of the trouble number is 
modified so that different equipment may be indicated by the Trouble 
Locating Manual listing. The listing for these trouble numbers is only 
accurate if the trouble is in the control unit access and not in the 
controller, so the request for diagnosis from the off-line control unit 
is rejected if there is a controller trouble from the on-line control 
unit. In addition, the off-line control unit is automatically diagnosed 
and must pass before the access is diagnosed. Diagnosis of a controller 
or controller access is also rejected unless the duplicate controller is 
in service for call processing use. For simplicity, only one diagnosis 
may be in progress at any one time and it will be aborted if a new 
trouble occurs while it is in progress. 

A separate Trouble Locating Manual is provided for each type of 
peripheral unit. 

5.2.2 Use oj Diagnostics 

The diagnostic programs are also used for restoring equipment to 
service after repair, for giving the peripheral system automatic routine 
exercise, and for testing new equipment additions. 

Controllers and their access are tested about once a day in a low 
traffic period, except for scanner output where service affecting 
troubles are not detected by check circuits while in use. These are 
tested more frequently (about every minute or faster, traffic permit- 
ting) . 

Remote execute facilities, with pushbuttons and indicator lamps 
located on various frames throughout the office, allow tests previously 
specified by teletypewriter to be stepped through or repeated by op- 
erating a pushbutton. Pass-fail results are indicated on lamps which 
are part of the remote execute facility. 

5.3 Peripheral System Growth 

New equipment added in an operating system for growth and for a 
new installation must be tested thoroughly before it is put into serv- 
ice. In an operating system, the installing and testing should have a 
minimal effect on service. Troubles that occur during shipment, instal- 
lation errors, and any troubles that did not show up in factory testing 
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must be corrected at this time. The use of connectors for interframe 
wiring reduces the installation interval. It also reduces the wiring 
errors since the connector wiring can be tested at the factory. 

The diagnostic test programs described are used where possible to 
help locate these troubles. The Trouble Locating Manual is not satis- 
factory for this purpose because several troubles and wiring errors 
may be present. Entries in the Trouble Locating Manual were formed by 
predicting test results for a single failure that could occur in an opera- 
tional system. Specifying test results for combinations of trouble is 
impractical and the craftsman must trace the symptoms manually. 
The diagnostic program provides a means to exercise all functions of 
the equipment and identify the trouble symptoms. The diagnostic 
program may print out a failing order or functional operation, for 
example, instead of a trouble number. The order or functional opera- 
tion may be requested, with the repeat option if desired, and the op- 
eration observed by oscilloscope or other test instruments. This 
method of locating faults is also useful for troubles that occur in 
operation in the few cases where the Trouble Locating Manual is in- 
sufficient. 

Additional test programs and manual test procedures are needed 
for some of the possible troubles at installation. Network link troubles, 
which are identified by error correlation techniques in normal use, 
require a program test for any new network frames added. Transla- 
tion data in the memory, which relates to equipment in the office, 
changes when equipment is added, and must also be verified. 

Such units as the scanner and supplementary central pulse distrib- 
utor tie into the common peripheral unit address bus and scan answer 
bus. The off-line control unit is not usable until the bus leads are 
connected and verified for correct wiring, polarity, and waveform in 
both directions from the new frame added. Interframe bus connectors 
help to minimize the exposure of the system to other troubles during 
this operation. If other troubles do occur, the system will stabilize in 
a "best" operating mode, and the craftsman may reinitiate this deci- 
sion process, if necessary, after restoring the bus integrity. 

Half of a scanner ferrod matrix (512 ferrods) may be added during 
growth. In this case, the output leads must be disconnected and the 
additional ferrods added in series with minimal effect on normal use 
of the other ferrods. Here again connectors are used to allow rapid 
change over and recovery in case of trouble in matrix addition. The 
repetitive test mode allows rapid indication of a trouble on pass-fail 
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lamps and provides a diagnostic printout so that the craftsman can 
minimize the time that a faulty matrix addition is connected. 

The network may be increased in size by the addition of a net- 
work control and junctor switching frame or by adding line or trunk 
switch frames to an existing network control and junctor switching 
frame. 3 The network control and junctor switching frame contains 
new network and scanner controllers. Both the access to these from 
the off-line control unit and the controller themselves are tested using 
the diagnostic tests. Controllers access to the line-trunk switching 
frame is also tested with the diagnostic tests. In both cases, the net- 
work links are tested with a special network fabric test program. 
While testing the network control and junctor switching frame, the 
junctors are connected back into the same network control and junc- 
tor switching frame at the junctor grouping frame in a standard test 
pattern. A junctor reassignment over all network control and junctor 
switching frames must then be performed before the new network 
control and junctor switching frame can be put into service. The junc- 
tor translation tables are updated in memory to indicate both the 
existing assignments and the new assignments. The junctors are seg- 
mented into four parts at the junctor grouping frame and only one 
of the segments is removed from service at a time for junctor modifica- 
tion. A verification of the new junctor connections with the transla- 
tion tables and a test of circuit junctors is made for each segment 
before those junctors are put back into service. After the reassign- 
ment is complete, any new trunks, service circuits, and lines to be 
added are tested and put into service. The common control frames 
such as the scanner or supplementary pulse distributor are, of course, 
tested before the circuits they monitor or control are tested. 

VI. MAINTENANCE MONITOR 

The base level maintenance monitor is the primary noncall process- 
ing executive program in the office and all maintenance programs come 
together through it. The responsibilities of the monitor include: 

(i) Recognizing changes in the system's state and the initiating a 
proper response to a state change (state detector) . 

(ii) Controlling almost all base level maintenance programs in the 
office (scheduler). 

(Hi) Monitoring all maintenance input messages from the craftsman. 

(iv) Initiating periodic work that must be done on a time schedule. 

(i>) Miscellaneous functions such as timing and controlling system 
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alarm conditions and checking the integrity of the entire base level 
scan. 

Figure 15 shows of the "information flow" through the main parts 
of the monitor. 

6.1 System Slate Detector 

State changes in the system are either caused by trouble detection 
mechanisms or are induced by the craftsman from the maintenance 
center or via the teletypewriter. 

It is the state detector's responsibility to look over the last scan 
and detect and take action on the occurrence of: (i) a mismatch inter- 
rupt, (u) an input-output trouble recovery action, (Hi) a system 
initialization or restart, (iv) the failure of an automatic error detect- 
ing circuit or a periodic control unit detection test in either control 
unit, and (v) manually initiated changes such as putting the system 
into the manual mode of operation. 

When the state detector sees that a change has taken place, it 
initiates an output message identifying the change, records the oc- 
currence of a control unit switch, and lets the craftsman known what 
automatic action will be taken. 

In addition, the state detector feeds state information to both the 
maintenance program scheduler and the teletypewriter maintenance 
input message monitor which enable these programs to control items 
already in progress in the system or coming into the system via the 
teletypewriter. For instance, when a craftsman puts the system into 
the manual mode, the state detector feeds the necessary information 
to the maintenance program scheduler to inform it that he has taken 
control over the off-line control unit, and no automatically initiated 
maintenance program from then on should interfere with him. 

The state detector also guarantees a consistency of hardware con- 
trol in the system. As an example, when the craftsman goes to the 
manual mode, the state detector blocks all interrupt signals from 
the maintenance center. If the craftsman wishes to generate a manual 
interrupt, he must then type in the necessary input request message 
which will store what the craftsman wishes to do when the interrupt 
occurs and release the interrupt "block." 

6.2 Maintenance Program Scheduler 

The heart of the maintenance monitor is the maintenance program 
scheduler. The scheduling algorithm is fairly simple and is tailor fit 
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to the type of programs it controls. These programs involve interface 
with the craftsman and are all multiscan (that is, they are allotted 
only a certain amount of real-time per scan and take many scans to 
complete) . In general, the scheduler will allow only one such program 
to be in progress in the system at any time. This avoids both confusion 
by the craftsman and interference problems which can result from 
running these programs concurrently. For instance, control unit diag- 
nostics program assumes it has absolute control over the off-line con- 
trol unit during the entire time it is in progress and no other program 
can change off-line register or memory contents. Off-line peripheral 
unit diagnostics assume the off-line control unit is healthy and can 
produce misleading information were they to be initiated while con- 
trol unit diagnostics are in progress. 

The basic "one-at-a-time" algorithm of the scheduler should also 
help desensitize the system to potential troubles which show up in 
different areas depending on the time of occurrence. 

The scheduler operates on a four-word memory block representing a 
matrix of four rows and 16 columns. Each maintenance program is 
assigned one or two column positions and has associated with it a 
"request," an "in progress," an "allow," and an "abort" bit. Figure 
16 is a picture of the matrix showing the matrix positions of the pro- 
grams under control of the scheduler. Any initiator of one of these 
programs (teletypewriter or automatic) simply sets the proper request 
bit in the matrix. The scheduler will then analyze this request with 
regard to whether the program is "allowable" in the present system 
state (the allow word is initialized every scan by the system state 
decoder) and whether a higher priority maintenance activity is now in 
progress in the system. The priority of the various programs is repre- 
sented by their column positions in the matrix. This a priori priority 
structure can be determined by considering both the programs them- 
selves and the request source. The various maintenance programs in 
the system operate in a realm of concentric circles in their importance 
to the system and in their assumptions (that is, off-line peripheral unit 
diagnostics assume that the off-line control unit is healthy). Auto- 
matic requests are triggered by changes in the system state and by 
automatic trouble detection signals; consequently they are more 
urgent than teletypewriter requests. 

The scheduler will take the abort entry associated with each pro- 
gram if: (i) a higher priority request enters the system while it is in 
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Fig. 16 — Maintenance program scheduler matrix. 



progress, or (ii) a change in the system state occurs and the state 
detector marks it nonallowable. 

Abort routines are necessary to prevent erroneous decisions by 
maintenance programs. These decisions can result in the improper 
removal of equipment from service and in misleading information on 
the teletypewriter. They also serve to let the craftsman know what 
is happening in the system. Further, in the case of a higher priority 
request, the time cannot be taken to let the lower priority activity 
finish but the request can be held until the abort is complete. The 
abort programs, if necessary, can distinguish between the two reasons 
for entering the abort by looking at the allow bit. For instance, the 
abort for the magnetization program (single card writer translation 
update) will not force the craftsman to start from the beginning if it 
finds a higher priority request entered. 

Various minor options are available with the scheduler. By use 
of logic masks, subsets of maintenance programs can be allowed to 
run concurrently. The craftsman, via a teletypewriter overwrite, can 
control the state detector and either allow or not allow a program to 
run until he tells the system otherwise. 

The scheduler also takes care of the control of "repetitive" and 
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"step" functions (remote execute facility) . The craftsman can specify 
the repetitive or step option on an input request for a maintenance 
program. The scheduler will then either continuously repeat, or 
repeat on signal the program requested and provide output on the 
teletypewriter or in lights depending on which option was chosen. The 
scheduler will allow only one repetitive or step function to be in pro- 
gress at any one time. The repeat or step control is very useful in the 
repair procedure. 

The maintenance programs not under control of the scheduler in- 
clude the service circuit and trunk tests, and tests associated with 
the ringing and tone plant and the automatic message accounting unit. 
Many of these tests are progress mark routines operating out of 
transient call records. The tests do not require the complex control of 
the programs under the scheduler and do not interfere with other 
maintenance activity in the office. As far as the maintenance monitor 
is concerned, the tests behave like a typical call in progress. The 
monitor, however, still is responsible for initiating these tests pe- 
riodically and the tests must be aware that a major maintenance ac- 
tion has occurred. 

G.3 Maintenance Input Message Monitor 

The maintenance input message monitor routes the various mes- 
sages from the teletypewriter to the proper subprograms and per- 
forms validity checks common to all inputs. If the craftsman specified 
a priority on the input or used the repetitive or step option, these 
items are checked for validity. In addition, the present system state 
computed by the state detector can be compared with the states al- 
lowable for a given input and the message rejected if the state is not 
correct. 

Input messages to change the system state (switch control units, 
remove off-line equipment from service, restore off-line equipment to 
service, and so on) or to provide system status information, are proc- 
essed wholly within the monitor. Messages that request programs 
under the scheduler's control merely set the proper request bit in the 
matrix. 

6.4 Time Monitor 

The time monitor initiates all periodic activity that operates on a 
time schedule. It must deal with time spans from seconds up to once 
a day. 
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Status information at the maintenance center, such as the stand-by 
lights, is updated once per second and small interval timing is pro- 
vided for traffic measurements. 

The time monitor serves to initiate periodic exercises on the con- 
trol until and the periphery during low traffic hours. These programs 
seek to find failures by exercising all equipment, thus avoiding the 
unnecessary failing of calls. The control unit exercise checks all the 
special maintenance circuitry in the processors, performs store margin 
tests and checks the ability of the control units to switch. 

6.5 Miscellaneous Functions 

The alarm monitor keeps track of various alarm conditions for the 
office, such as fuse failures, and times local alarms if the alarms are 
transferred to a remote location. The fuse ferrods are checked at in- 
tervals from the time monitor. 

The maintenance monitor is responsible for resetting the program 
timers and checking the integrity of the entire base level scan. Soft- 
ware checks are made to detect program skipping and the program 
timers protect the system against program looping. 1 

During every scan the monitor calls in the teletypewriter base level 
processing program and programs which deal with trouble correlation 
and measurements. Counts are kept of almost all troubles in the sys- 
tem ranging from customer receiver troubles to failure of control unit 
diagnostics. These service and performance measurements ("plant") 
should give the craftsman a good picture of the total "health" of the 
system. 

VII. DATA MAINTENANCE 

One of the primary maintenance objectives for any electronic 
switching system is to insure the best possible integrity of call store 
information in the system during periods of trouble. Call store mutila- 
tion can result from hardware faults and intermittents, and from pro- 
gram bugs. During periods of nonsynchronous operation, the mismatch 
detection mechanism is lost to the system and the detection of some 
hardware troubles will be delayed until a specific periodic test finds 
them (that is, those that would not be caught by the other automatic 
error detecting circuits). In this situation, data multilation can occur 
between the time the trouble occurs and the time it is detected. 

System response to transient and intermittent failures depends on 
programs recognizing a high error signal occurrence over a period of 
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time. Very infrequently accessed branches of the program can have 
bugs which will mutilate some memory despite the best effort in pro- 
gram debugging prior to cutover. It is interesting that the mismatch 
detection mechanism and synchronous operation is not much help for 
program bugs since for most bugs the processors will maintain syn- 
chronization. In this sense, a program bug is equivalent to two simul- 
taneous and identical hardware failures. 

7.1 Preventive Techniques 

The potential effect of any of these problems on memory is highly 
dependent on the basic program algorithms of the system. The ease 
of communication among programs, the absence of linked list struc- 
tures, and the per call assignment of major blocks of memory signifi- 
cantly aids the task of data maintenance. For instance, the progress 
mark approach to call processing taken by the No. 2 ESS assigns an 
arbitrary block of call store (transient call record) to each call which 
remains fixed while it is being processed ("transient"). Additional 
storage associated with the call is added as necessary (peripheral 
order buffers and originating registers). Each progress mark routine 
is entered with the transient call record or peripheral order buffer 
block address as data, and the routine works within the block with 
relative addressing. The scope of a progress mark routine is then a 
small set of data relating to a single call. 

In addition, defensive programming techniques are used throughout 
the No. 2 ESS. Table indexes obtained from call store are range- 
checked or small tables are made complete to cover all possible index 
values with invalid indexes pointing to error routines. Programs that 
transfer, based on the call store, check the address for all zeroes first 
in case the recovery programs had zeroed this word so that they will 
not continue to transfer wildly. Translation information is obtained 
by accessing a master table index which will insure that, with bad 
data in call store, program instructions will not erroneously be read 
as data with a resultant parity error failure. The intent of such de- 
fensive programming techniques is to allow processing to continue in 
the face of bad data and limit the effect of the data to one or a few 
calls. 

7.2 Corrective Techniques 

Despite the algorithms employed and the defensive techniques 
used, programs are still required which will detect and recover from 
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bad data. In the No. 2 ESS, these involve (i) audit programs, and 
(ii) system recovery or initialization programs. 

7.2.1 Audits 

The No. 2 ESS call store memory contains many items of redundant 
information in different forms, some associated with individual calls 
and other information primarily equipment oriented. The memory 
also contains links connecting blocks associated with a call. It is the 
function of audit programs to ascertain whether these various items 
in memory are consistent. 

Separate audit programs are written for the various memory blocks 
such as the transient call records, terminal memory records, line status 
bits, originating registers, peripheral order buffers, and the network 
map. 2 For example, the originating register audit program checks for 
a correct linkage from the originating register to a terminal memory 
record and transient call record. When an audit program finds in- 
consistencies, it attempts to idle the memory blocks and, if possible, 
the corresponding equipment. The audit programs are called in pe- 
riodically from the time monitor, can be initiated from the teletype- 
writer, and form an important part of the system recovery strategy. 

7.2.2 System Recovery 

The system recovery program (or "initialization" program) is 
triggered by hardware (program timers 1 ) upon the occurrence of a 
control unit switch when the system was not running in synchroniza- 
tion. Multiple control unit switching while in the synchronous mode 
will also activate it. Control unit switching can be caused by bad 
data and software bugs as well as hardware failures. For example, the 
program timers will switch control units if the program hangs up in 
a loop. The control unit switch itself is the proper response to a hard- 
ware failure and the recovery program's job is to: 

(i) Insure the new on-line call store is reasonably consistent with 
the state of the periphery regardless of the cause of the control unit 
switch. When the off-line control unit is inhibited and is not under 
the control of a craftsman, the base level maintenance monitor acti- 
vates circuitry which causes the off-line call store to automatically 
track the on-line call store. If a control unit switch occurs, the early 
"phases" of recovery can assume the new on-line call store is con- 
sistent. If, however, a switch occurs while the craftsman is in control, 
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the entire new off-line call store must be moved across to the new 
on-line call store. 

(ii) Clear memory in increasingly larger segments in an attempt 
to stabilize the system if software is at fault. 

A typical sequence of recovery attempts would involve: 

(i) Isolating the program in control of the system at the time of the 
control unit switch, and taking appropriate action on memory blocks 
associated with the program (either clearing the blocks or marking 
them bad for later action). 

(ii) Calling in the audit programs in an emergency mode. 

(Hi) Clearing all transient data in the call store while preserving 
the stable call records. 

The craftsman can force in the recovery program and he alone can 
trigger the initialization of the stable data. Any recovery attempt will 
notify the base level maintenance monitor to abort any maintenance 
program in progress in the system. 

VIII. EXAMPLE OF SYSTEM TROUBLE 

We will now go through a trouble example where a fault occurs in 
the input-output unit of the control unit affecting the access to a 
peripheral unit. We describe the sequence of actions for detection, 
recovery, discontinuing the troubleshooting that was in progress, 
diagnosis, and repair. 

Assume that network controller 1 for line trunk network 2 is out of 
service and the craftsman is troubleshooting this controller. He has 
typed in a request to repetitively execute an order to the controller. 
Assume, at this time, that the system is in synchronism with control 
unit on-line and a central pulse distributor enable translator gate 
fails (see Fig. 17). This failure is first detected by a program scan 
order to master scanner 3, controller 0. 

8.1 Trouble Detection and Recovery 

The failing scan order returns an indication which results in a pro- 
gram transfer to a maintenance recovery program. This program 
verifies the order failure and retries with duplicate controller 1 which, 
for the trouble specified, will also fail. The recovery program then 
switches control units and again retries the order which now will be 
successful. 

The new off-line control unit is marked out of service with bad 
access to master scanner 3 recorded. The following message is printed: 
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36 MA CU RMV MS 03 



Control Unit Removed from 
service for faulty access to 
Master Scanner 03 



Maintenance 
Action 



Time in minutes 
since the last hour 



Major alarm 

Both controllers of the scanner remain in service from the on-line 
control unit 1. A request is made for diagnosis of the stand-by control 
unit and its access to master scanner 3 controller. In addition, the 
base level maintenance monitor is notified to abort any conflicting 
activities. The trouble recovery program then returns control to the 
call program. Only a few milliseconds have elapsed since the trouble 
was detected and the scan order was accomplished, so service is not 
affected. 

When the maintenance monitor gets control at the next end of 
scan, it will cause the repetitive order operation that the craftsman 
initiated to be aborted. The craftsman will be informed of this action 
by the pass-fail lamps both being dark, and by the following tele- 



TO PERIPHERAL SYSTEM 



FROM X 
TRANSLATOR 



MORE THAN ONE 
Z DETECTOR 



CENTRAL PULSE 
DISTRIBUTOR MATRIX 

(I OUT OF 513 
BIPOLAR OUTPUTS) 



FROM Y 

TRANSLATOR 



Z TRANSLATOR 



AT LEAST ONE 
Z DETECTOR 




ENABLE ADDRESS REGISTER 



LOCATED IN 

INPUT- OUTPUT 

UNIT 

EQUIPMENT 

LOCATION 

TRAY 30 

CARD 

POSITION 46 



Fig. 17 — A control unit fault. 
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typewriter message: 

36 MR NW ORD 02 01 ABT 



Controller 0, from 
from off-line con- 
trol unit 1, has 
been Aborted 



Network Order 
function for line 
or trunk switch- 
ing network 02 



Maintenance Response 
related to a previous 
manual request 



Minutes since 
the last hour 



8.2 Diagnostics 

Once the abort is completed, the monitor will recognize the diag- 
nostic request made by the recovery program and diagnostics will be 
initiated interleaved with call processing. In this case, control unit 
diagnostics will fail in an input-output circuitry test block. 

As shown in Fig. 9, input-output circuitry is tested by test blocks 
located near the end of the diagnostic sequence. This means input- 
output circuitry can be tested by programs run entirely in the off- 
line control unit since almost all off-line program control circuitry has 
been previously tested. 

The on-line control unit uses external commands to first set up 
an off-line program starting address and then to start the off-line 
program control. At some later time (after sufficient time to allow test 
completion) , the on-line control unit looks at the state of the off-line 
unit to determine if the off-line unit passed or failed its diagnostic test. 
If it failed, the state of off-line registers is used to form the diagnostic 
printout. 

Diagnostic test block 38 tests central pulse distributor circuitry in 
this manner. Internal maintenance commands are executed in the 
off-line program control with the input-output stopped in order to test 
the Z central pulse distributor translator. The enable address register is 
first set to an address which should select a particular Z translator 
gate. The translator is then enabled using a maintenance instruction 
which generates clock pulses for one clock interval. If the Z trans- 
lator does not fire properly, one of Z check circuits will produce an 
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error signal: the "not more than one Z detector" or the "at least one 
Z detector." The "at least one Z detector" will produce an error when 
the eazo gate is selected because no output can be produced. In test 
segment 12 of block 38, the off-line program control exercises all 16 Z 
translator gates and accumulates a data word which has a 1 corres- 
ponding to each translator gate which resulted in an "at least one Z 
detector" or a "not more than one Z detector" error signal. For the 
eazo gate output ground failure, the data word is 0000000000000001 in 
binary or 000001 in octal code. 

The off-line control unit stops itself when the Z translator data 
word is found to be nonzero. The on-line control unit finds that the 
off-line control did not successfully complete the central pulse dis- 
tributor test and uses the off-line state to form a diagnostic printout. 
The block number (38) and the segment number (12), plus the data 
word 000001 in octal form, are used to form the diagnostic printout: 



3S 



MA CU DGN 



38 



12 



000001 



Test data 
in octal 



Teat segment 

number 



Test block number 

Maintenance Action Diagnostic 
printout for Control Unit 

Time in minutes since 
the last hour 

Major alarm 

The craftsman uses this diagnostic printout to look up the transla- 
tion section of the Trouble Locating Manual to obtain a list of replace- 
ment circuit packs. Figure 2 shows the Trouble Locating Manual entry 
corresponding to the diagnostic printout obtained for the Z transla- 
tor failure. Notice that the Trouble Locating Manual entry for 3812 
contains a short explanation of the failure area. The circuit packs 
are identified by the circuit pack location (10-030-46) and the circuit 
pack type (A403). 

8.3 Repair 

After obtaining the list of replacement circuit packs, the craftsman 
next requests continuous execution of the failing test block by tele- 
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typewriter : 

MR CU:DGN:38! 



Repetitive Control Unit 
Diagnostic request for 
block 38. 

This request should produce a verification that the failure still 
exists by producing a teletypewriter printout identical to that originally 
obtained and should cause the fail light on the maintenance center 
panel to come on. This fail light is turned on by program each time 
the diagnostic fails. The maintenance center panel pass light is turned 
on if the test passes. 

Circuit packs are replaced with this request running in the active 
control unit. Of course, off-line power must be removed while a circuit 
pack is being replaced. In this case, almost immediately after the 
first circuit pack (10-030-46) is replaced and power is restored, the 
fail light should go out and the pass light should come on indicating 
the trouble has been fixed. 

After the fault has been repaired, the craftsman can now type 
teletypewriter requests to remove the diagnostic test and to restore the 
off-line to service: 

M SY:CLR! 



Clear out repetitive request 
M CU:RST! 



Restore off-line Control Unit to service 

These requests should turn off the out-of-service light on the main- 
tenance center panel and put the two control units back in syn- 
chronism, after first successfully completing another test of the standby 
control unit and its access to master scanner 3 controller 0. 

The craftsman may now return to the original problem in con- 
troller 1 of line trunk network 2 by again requesting a repetitive net- 
work order. 
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